Improving speech recognition and keyword search for low resource languages using web data

نویسندگان

  • Gideon Mendels
  • Erica Cooper
  • Victor Soto
  • Julia Hirschberg
  • Mark J. F. Gales
  • Kate Knill
  • Anton Ragni
  • Haipeng Wang
چکیده

We describe the use of text data scraped from the web to augment language models for Automatic Speech Recognition and Keyword Search for Low Resource Languages. We scrape text from multiple genres including blogs, online news, translated TED talks, and subtitles. Using linearly interpolated language models, we find that blogs and movie subtitles are more relevant for language modeling of conversational telephone speech and obtain large reductions in out-of-vocabulary keywords. Furthermore, we show that the web data can improve Term Error Rate Performance by 3.8% absolute and Maximum Term-Weighted Value in Keyword Search by 0.0076-0.1059 absolute points. Much of the gain comes from the reduction of out-of-vocabulary items.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search

We describe a system to collect web data for Low Resource Languages, to augment language model training data for Automatic Speech Recognition (ASR) and keyword search by reducing the Out-ofVocabulary (OOV) rates – words in the test set that did not appear in the training set for ASR. We test this system on seven Low Resource Languages from the IARPA Babel Program: Paraguayan Guarani, Igbo, Amha...

متن کامل

A comparison of multiple methods for rescoring keyword search lists for low resource languages

We review the performance of a new two-stage cascaded machine learning approach for rescoring keyword search output for low resource languages. In the first stage Confusion Networks (CNs) are rescored for improved Automatic Speech Recognition (ASR) by reranking the arcs of each confusion bin. In the second stage we generate keyword search hypotheses from the rescored ASR output and rescore them...

متن کامل

Spoken Keyword Rescoring and Document Retrieval for Low-resource Languages

For languages that have adequate data for automatic speech recognition (ASR), many keyword search(KWS) and document retrieval(SDR) systems have been developed with near-optimal performance. However, lacking of sufficient training data to produce high accuracy transcript, identification and retrieval of queries in speech data from low-resources languages remains challenging. To compensate for th...

متن کامل

Developing Keyword Search under the Iarpa Babel Program

Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. Keyword search (KWS), also known as spoken term detection (STD), is a speech processing task in which the goal is to find all the occurrences of a textual “keyword”, a sequence of one or more words, in a large corpus of speech data. In 2006, the U.S. National Institute of S...

متن کامل

Strategies for rescoring keyword search results using word-burst and acoustic features

The identification of keyword queries in speech data from lowresources languages poses a challenge for current methods as speech recognition algorithms lack sufficient training data to produce high accuracy transcript. To compensate for these shortcomings, we extract signals from the data that are useful in keyword identification but are not being used by the speech recognizer. These signals ta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015